Scaling out data preprocessing with Hive
نویسندگان
چکیده
We introduce a user-friendly graphical data preprocessing application based on Hive, one of the well known opensource distributed warehouse systems. It is comfortable, and easy to use for preprocessing purposes, but to prove usability of this application, we created measurement a framework to ensure precise results. These results show that our application has outstanding scaling capability in the case of increasing data amount and increasing number of applied computers. We conclude that this is a good solution for medium and large scale data preprocessing.
منابع مشابه
Data Quality for Web Log Data Using a Hadoop Environment
Solving data quality problems is important for data warehouse construction and operation. This paper is based on developing a web log warehouse. It proposes a data quality problem methodology for data preprocessing within the log warehouse. It provides a hierarchical data warehouse architecture that is suitable for resource saving and ad hoc requirements. The data preprocessing is completed usi...
متن کاملEnhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining
This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...
متن کاملARPN Journal of Science and Technology::Analysis of Movie Lens Data Set using Hive
Large scale data set provides the better opportunity to find out much better data relationship in the area of business intelligence. In the paper, we implement our systems using Hadoop that has been popular to store and compute Big Data. However, it is not easy to write Hadoop Map Reduce code. Therefore, we use Hive and Hive QL codes to understand the relationships between ratings and the users...
متن کاملThe HIVE Tool for Informed Swarm State Space Exploration
Swarm verification and parallel randomised depth-first search are very effective parallel techniques to hunt bugs in large state spaces. In case bugs are absent, however, scalability of the parallelisation is completely lost. In recent work, we proposed a mechanism to inform the workers which parts of the state space to explore. This mechanism is compatible with any action-based formalism, wher...
متن کاملExecution Primitives for Scalable Joins and Aggregations in Map Reduce
Analytics on Big Data is critical to derive business insights and drive innovation in today’s Internet companies. Such analytics involve complex computations on large datasets, and are typically performed on MapReduce based frameworks such as Hive and Pig. However, in our experience, these systems are still quite limited in performing at scale. In particular, calculations that involve complex j...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011